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Abstract. The queries defined on data warehouses are complex and use 
several join operations that induce an expensive computational cost. This 
cost becomes even more prohibitive when queries access very large vol- 
umes of data. To improve response time, data warehouse administrators 
generally use indexing techniques such as star join indexes or bitmap join 
indexes. This task is nevertheless complex and fastidious. Our solution 
lies in the field of data warehouse auto-administration. In this frame- 
work, we propose an automatic index selection strategy. We exploit a 
data mining technique ; more precisely frequent itemset mining, in order 
to determine a set of candidate indexes from a given workload. Then, 
we propose several cost models allowing to create an index configuration 
composed by the indexes providing the best profit. These models evalu- 
ate the cost of accessing data using bitmap join indexes, and the cost of 
updating and storing these indexes. 



1 Introduction 

Data warehouses are generally modelled according to a star schema that con- 
tains a central, large fact table, and several dimension tables that describe the 
facts [10, 11]. The fact table contains the keys of the dimension tables (foreign 
keys) and measures. A decision-support query on this model needs one or more 
joins between the fact table and the dimension tables. These joins induce an 
expensive computational cost. This cost becomes even more prohibitive when 
queries access very large data volumes. It is thus crucial to reduce it. 

Several database techniques have been proposed to improve the computa- 
tional cost of joins, such as hash join, merge join and nested loop join [14]. How- 
ever, these techniques are efficient only when a join applies on two tables and data 
volume is relatively small. When the number of joins is greater than two, they are 
ordered depending on the joined tables (join order problem). Other techniques, 
used in the data warehouse environment, exploit join indexes to pre-compute 
these joins in order to ensure fast data access. Data warehouse administrators 
then handle the crucial task of choosing the best indexes to create (index selec- 
tion problem). This problem has been studied for many years in databases [1, 



4-7,12,18]. However, it remains largely unresolved in data warehouses. Exist- 
ing research studies may be clustered in two families: algorithms that optimize 
maintenance cost [13] and algorithms that optimize query response time [2, 8, 9]. 
In both cases, optimization is realized under the constraint of the storage space. 
In this paper, we focus on the second family of solutions, which is relevant in 
our context because they aim to optimize query response time. 

In addition, with the large scale usage of databases in general and data 
warehouses in particular, it is now very important to reduce the database ad- 
ministration function. The aim of auto-administrative systems is to administrate 
and adapt themselves automatically, without loss (or even with a gain) in per- 
formance. In this context, we proposed a method for index selection in databases 
based on frequent itemset extraction from a given workload [3]. In this paper, we 
present the follow-up of this work. Since all candidate indexes provided by the 
frequent itemset extraction phase cannot be built in practice due to system and 
storage space constraints, we propose a cost model-based strategy that selects 
the most advantageous indexes. Our cost models estimate the data access cost 
using bitmap join indexes, and their maintenance and storage cost. 

We particularly focus on bitmap join indexes because they are well-adapted 
to data warehouses. Bitmap indexes indeed make the execution of several com- 
mon operations such as And, Or, Not or Count efficient by having them operating 
on bitmaps, in memory, and not on the original data. Furthermore, joins are pre- 
computcd at index creation time and not at query execution time. The storage 
space occupied by bitmaps is also low, especially when the indexed attribute 
cardinality is not high [17, 19]. Such attributes are frequently used in decision- 
support query clauses such as Where and Group by. 

The remainder of this paper is organized as follows. We first remind the 
principle of our index selection method based on frequent itemset mining (Sec- 
tion 2). Then, we detail our cost models (Section 3) and our index selection 
strategy (Section 4). To validate our work, we also present some experiments 
(Section 5). We finally conclude and provide research perspectives (Section 6). 

2 Index selection method 

In this section, we present an extension to our work about the index selection 
problem [3]. The method we propose (Figure 1) exploits the transaction log 
(the set of all the queries processed by the system) to recommend an index 
configuration improving data access time. 

We first extract from a given workload a set of so called indexable at- 
tributes. Then, we build a "query-attribute" matrix whose rows represent work- 
load queries and whose columns represent a set of all the indexable attributes. 
Attribute presence in a query is symbolized by one, and absence by zero. It is 
then exploited by the Close frequent itemset mining algorithm [16]. Each itemset 
is analyzed to generate a set of candidate indexes. This is achieved by exploiting 
the data warehouse metadata (schema: primary keys, foreign keys; statistics. . . ). 
Finally, we prune the candidate indexes using the cost models presented in Sec- 
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Fig. 1. Automatic index selection strategy 
Table 1. Cost model parameters 



Symbol 


Description 


1*1 


Number of tuples in table X or cardinality of attribute X 


Sp 


Disk page size in bytes 


px 


Number of pages needed to store table X 


Spointer 


Page pointer size in bytes 


m 


B-tree order 


d 


Number of bitmaps used to evaluate a given query 


w(X) 


Tuple size in bytes of table X or attribute X 



tion 3, before effectively building a pertinent index configuration. We detail these 
steps in the following sections. 



3 Cost models 



The number of candidate indexes is generally as high as the input workload is 
large. Thus, it is not feasible to build all the proposed indexes because of system 
limits (limited number of indexes per table) or storage space constraints. To 
circumvent these limitations, we propose cost models allowing to conserve only 
the most advantageous indexes. These models estimate the storage space (in 
bytes) occupied by bitmap join indexes, the data access cost using these indexes 
and their maintenance cost expressed in number of input/output operations 
(I/Os). Table 1 summarizes the notations used in our cost models. 



3.1 Bitmap join index size 



The space required to store a simple bitmap index linearly depends on the in- 
dexed attribute cardinality and the number of tuples in the table on which the 
index is built. The storage space of a bitmap index built on attribute A from ta- 
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ble T is equal to g ' bytes [19, 20]. Bitmap join indexes are built on dimension 
table attributes. Each bitmap contains as many bits as the number of tuples in 
fact table F. The size of their storage space is then S = ^ F bytes. 

3.2 Bitmap join index maintenance cost 

Data updates (mainly insert operations in decisions-support systems) systemi- 
cally trigger index updates. These operations are applied either on a fact table 
or dimensions. The cost of updating bitmap join indexes is presented in the 
following sections. 

Insertion cost in fact table. Assume a bitmap join index built on attribute 
A from dimension table T. While inserting tuples in fact table F, it is first 
necessary to search for the tuple of T that is able to be joined with them. At 
worst, the whole table T is scanned (Pt pages are read). It is then necessary 
to update all bitmaps. At worst, all bitmaps are scanned: pages are read, 

where S p denotes the size of one disk page. The index maintenance cost is then 

r — i \ a \\f\ 

^maintenance PT i SS P ' 

Insertion cost in dimension tables. An insertion in dimension T may induce 
or not a domain expansion for attribute A. When not expanding the domain, the 
fact table is scanned to search for tuples that are able to be joined with the new 
tuple inserted in T. This operation requires to read pf pages. It is then necessary 
to update the bitmap index. This requires I/Os. When expanding the 

domain, it is necessary to add the cost of building a new bitmap pages). 
The maintenance cost of bitmap join indexes is then C ma intenance = Pf + (1 + 
£) ' wnere £ i s equal to one if there is expansion and zero otherwise. 

3.3 Data access cost 

We propose two cost models to estimate the number of I/Os needed for data 
access. In the first model, we do not take any hypothesis about how indexes are 
physically implemented. In the second model, we assume that access to the index 
bitmaps is achieved through a b-tree such as is the case in Oracle. Due to lack 
of space and our experiments under Oracle we only detail here the second model 
because of running our experiments, however, the first model is not detailed here 
due to the lack of space. 

In this model, we assume that the access to bitmaps is realized through a b- 
tree (meta-indexing) in which leaf nodes point to bitmaps (appendix figure ??). 
The cost, in number of I/Os, of exploiting a bitmap join index for a given query 
may be written as follows: C = C 'descent + C scan + C rea d, where C 'descent 



denotes the cost needed to reach the leaf nodes from the b-tree root, C scan 
denotes the cost of scanning leaf nodes to retrieve the right search key and the 
cost of reading the bitmaps associated to this key, and C rea d finally gives the 
cost of reading the indexed table's tuples. 

The descent cost in the b-tree depends on its height. The b-tree's height built 
on attribute A is log m \A\, where m is the b-tree's order. This order is equal to 
K + 1, where K represents the number of search keys in each b-tree node. K is 
equal to w ^ A ^g . - , where w(A) and S po i n t e r are respectively the size of the 
indexed attribute A and the size of a disk page pointer in bytes. Without adding 
the b-tree leaf node level, the b-tree descent cost is then C 'descent = log m \A\ — 1. 

The scanning cost of leaf nodes is -14fj (at worst, all leaf nodes are read). 
Data access is achieved through bits set to one in each bitmap. In this case, it 
is necessary to read each bitmap. The reading cost of d bitmaps is d-^-. Hence, 

the scanning cost of the leaf nodes is C scan = -^L + d$-. 

The reading cost of the indexed table's tuples is computed as follow. For a 
bitmap index built on attribute A, the number of read tuples is equal to j^y (if 
data arc uniformly distributed). Generally, the total number of read tuples for 
a query using d bitmaps is N r — djjj. Knowing the number of read tuples, the 

_ N r 

number of I/Os in the reading phase is C rea d = Pf(1 — e p p ) [15], where pf 
denotes the number of pages needed for store the fact tabic. 

In summary, the evaluation cost of a query exploiting a bitmap join index is 

C index =log m \A\ - l + 

3.4 Join cost without indexes 

If the bitmap join indexes are not useful while evaluating a given query, we 
assume that all joins are achieved by the hash-join method. The number of 
I/Os needed for joining table R with table S is then Chash = 3 (ps + Pr) [14]. 

4 Bitmap join index selection strategy 

Our index selection strategy proceeds in several steps. The candidate index set 
is hrst built from the frequent itcmscts mined from the workload (Section 2). A 
greedy algorithm then exploits an objective function based on our cost models 
(Section 3) to prune the least advantageous indexes. The detail of these steps and 
the construction of the objective function are provided in the following sections. 

4.1 Candidate index set construction 

From the frequent itcmscts (Section 2) and the data warehouse schema (foreign 
keys of the fact table, primary keys of the dimensions, etc.), we build a set of 
candidate indexes. 



The SQL statement for building a bitmap join index is composed of three 
clauses: On, From and Where. The On clause is composed of attributes on which is 
built the index (non-key attributes in the dimensions), the From clause contains 
all joined tables and the Where clause contains the join predicates. 

We consider a frequent itcmsct < Table. attribute^ ...,Table.attribute n > 
composed of elements such as Table. attribute. Each itemset is analyzed to de- 
termine the different clauses of the corresponding index. We first extract the 
elements containing foreign keys of the fact table because they are necessary to 
define the From and Where index clauses. Next, we retrieve the itemset elements 
that contain primary keys of dimensions to form the From index clause. The 
elements containing non-key attributes of dimensions form the On index clause. 
If such elements do not exist, the bitmap join index cannot be built. 

4.2 Objective functions 

In this section, we describe three objective functions to evaluate the variation of 
query execution cost, in number of I/Os, induced by adding a new index. The 
query execution cost is assimilated to computing the cost of hash joins if no 
bitmap join index is used or to the data access cost through indexes otherwise. 
The workload execution cost is obtained by adding all execution costs for each 
query within this workload. The first objective function advantages the indexes 
providing more profit while executing queries, the second one advantages the 
indexes providing more benefit and occupying less storage space, and the third 
one combines the first two in order to select at first all indexes providing more 
profit and then keep only those occupying less storage space when this resource 
becomes critical. The first function is useful when storage space is not limited, the 
second one is useful when storage space is small and the third one is interesting 
when this storage space is quite large. The detail of computing each function is 
not given due to the lack of space. 

4.3 Index configuration construction 

The index selection algorithm is based on a greedy search within the candidate 
index set / given as an input. The objective function F must be one of the 
functions: profit (P), profit/space ratio (R) or hybrid (H). If R is used, we add 
to the algorithm's input the space storage M allotted for indexes. If H is used, 
we also add threshold a as input. 

In the first algorithm iteration, the values of the objective function are com- 
puted for each index within /. The execution cost of all queries in workload Q 
is equal to the total cost of hash joins. The index i max that maximizes F, if it 
exists, is then added to the set of selected indexes S. If R or H is used, the whole 
space storage M is decreased by the amount of space occupied by i ma x- 

The function values of F are then recomputed for each remaining index in 
I — S since they depend on the selected indexes present in 5*. This helps taking 
into account the interactions that probably exist between the indexes. We repeat 
these iterations until there is no improvement or all indexes have been selected 




Fig. 2. Profit function Fig. 3. Index storage space 



(/ — S = 0) . If functions R or H are used, the algorithm also stops when storage 
space is full. 

5 Experiments 

In order to validate our bitmap join index selection strategy, we have run tests 
on a data warehouse implemented within Oracle 9i, on a Pentium 2.4 GHz PC 
with a 512 MB main memory and a 120 GB IDE disk. This data warehouse 
is composed of the fact table Sales and five dimensions Customers, Products, 
Promotions, Times and Channels. We have measured for different value of the 
minimal support parameterized in Close the workload execution time. In prac- 
tice, the minimal support limits the number of candidate indexes to generate 
and selects only those that are frequently used. 

For computing the different costs from our models, we fixed the value of S p 
(disk page size) and S po i n ter (page pointer size) to 8 MB and 4 MB respectively. 
These values are those indicated in the Oracle 9i configuration file. The work- 
load is composed of forty decision-support queries containing several joins. We 
measured the total execution time when building indexes or not. In the case of 
building indexes, we also measured the total execution time when we applied 
each objective function among of profit, ratio profit/space and hybrid. We also 
measured the disk space occupied by the selected indexes. When applying the 
cost models, we reduce the number of indexes and thereby the storage space 
needed to store these indexes. 

Profit function experiment. Figure 2 shows that the selected indexes im- 
prove query execution time with and without application of our cost models 
until the minimal support forming frequent itemsets reaches 47.5%. Moreover, 
the execution time decreases continuously when the minimal support increases 
because the number of indexes decreases. For high values of the minimal support 
(greater than 47.5%), the execution time is closer to the one obtained without 



indexes. This case is predictable because there is no or few candidate indexes 
to create. The maximal gain in time in both cases is respectively 30.50% and 
31.85%. Despite of this light drop of 1.35% in time gain when the cost models 
are used (fewer indexes are built), we observe a significant gain in storage space 
(equal to 32.79% in the most favorable case) as shown in figure 3. This drop 
in number of indexes is interesting when the data warehouse update frequency 
is high because update time is proportional to the number of indexes. On the 
other hand, the gain in storage space helps limiting the storage space allotted 
for indexes by the administrator. 

Profit/space ratio function experiment. In these experiments, we have fixed 
the value of minimal support to 1%. This value gives the highest number of 
frequent itemsets and consequently the highest number of candidate indexes. 
This helps varying storage space within a wider interval. We have measured 
query execution time according to the percentage of storage space allotted for 
indexes. This percentage is computed from the space occupied by all indexes. 
Figure 4 shows that execution time decreases when storage space occupation 
increases. This is predictable because we create more indexes and thus better 
improve the execution time. We also observe that the maximal time gain is equal 
to 28.95% and it is reached for a space occupation of 59.64%. This indicates 
that if we fix space storage to this value, we obtain a time gain close to the one 
obtained with the profit objective function (30.50%). This case is interesting 
when the administrator does not have enough space to store all the indexes. 
Hybrid function experiment. We repeated the previous experiments with 
the hybrid objective function. We varied the value of parameter a between 0.1 
and 1 by 0.1 steps. The obtained results with a e [0.1, 0.7] and a <E [0.8, 1] are re- 
spectively equal to those obtained with a = 0.1 and a = 0.7. Thus, we represent 
in figure 5 only the results obtained with a = 0.1 and a = 0.7. This figure shows 
that for a = 0.1, the results are close to those obtained with profit /space ratio 
the function ; and for a = 0.8, they are close to those obtained with the profit 
function. The maximal gain in execution time is respectively equal to 28.95% 
and 29.95% for a = 0.1 and a = 0.8. We explain these results by the fact that 
bitmap join indexes built on several attributes need more storage space. How- 
ever, as they pre-compute more joins, they better improve the execution time. 
The space storage allotted for indexes then fills up very quickly after a few it- 
erations of the greedy algorithm. This explains why the parameter a does not 
significantly affect our algorithm and the experiment results. 

6 Conclusion and perspectives 

In this article, we presented an automatic strategy for bitmap index selection 
in data warehouses. This strategy first exploits frequent itemsets obtained by 
the Close algorithm from a given workload to build a set of candidate bitmap 
join indexes. With the help of cost models, we keep only the most advantageous 
candidate indexes. These models estimate data access cost through indexes, as 
well as maintenance and storage cost for these indexes. We have also proposed 
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three objective functions: profit, profit/space ratio and hybrid that exploit our 
cost models to evaluate the execution cost of all queries. These functions are 
themselves exploited by a greedy algorithm that recommends a pertinent con- 
figuration of indexes. This helps our strategy respecting constraints imposed by 
the system (limited number of indexes per table) or the administrator (storage 
space allotted for indexes). Our experimental results show that the application 
of cost models to our index selection strategy decreases the number of selected 
indexes without a significant loss in performance. This decrease actually guar- 
antees a substantial gain in storage space, and thus a decrease in maintenance 
cost during data warehouse updates. 

Our work shows that the idea of using data mining techniques for data ware- 
house auto-administration is a promising approach. It opens several future re- 
search axes. First, it is essential to keep on experimenting in order to better eval- 
uate system overhead in terms of index building and maintenance. It could also 
be very interesting to compare our approach to other index selection methods. 
Second, extending our approach to other performance optimization techniques 
(materialized views, buffering, physical clustering, etc.) is another promising 
perspective. Indeed, in a data warehouse environment, it is principally in con- 
junction with other physical structures such as materialized views that indexing 
techniques provide significant gains in performance. For example, our context 
extraction may be useful to build clusters of queries that maximize the simi- 
larity between queries within each cluster. Each cluster may be then a starting 
point to materialize views. In addition, it could be interesting to design methods 
to efficiently share the available storage space between indexes and views. 
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